NODEJS-681: ControlConnection Concurrent Read and Write on .host and .connection#462
NODEJS-681: ControlConnection Concurrent Read and Write on .host and .connection#462toptobes wants to merge 5 commits into
Conversation
| assert.strictEqual(cc.hosts.length, 1); | ||
| }); | ||
|
|
||
| it('should not break when refreshing concurrently', async () => { |
There was a problem hiding this comment.
may need a better heuristic for ensuring the refreshes are okay... not sure... I just "borrowed" this from Jane's original PR
There was a problem hiding this comment.
Pull request overview
This PR addresses NODEJS-681 by adding concurrency protection around ControlConnection._refresh() to avoid concurrent refresh executions that can lead to inconsistent .host / .connection state, and adds an integration test intended to exercise concurrent refresh behavior.
Changes:
- Add
_refreshInProgressguard and refactor refresh logic into_unsafeDoRefresh()inControlConnection. - Add an integration test that triggers many
_refresh()calls. - Make
promiseUtils.toBackground()tolerateundefined/nullinputs via optional chaining.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| lib/control-connection.js | Adds a refresh-in-progress guard and refactors refresh implementation into a separate method. |
| lib/promise-utils.js | Makes toBackground() no-op safely when given a nullish value. |
| test/integration/short/control-connection-tests.js | Adds a concurrency-focused integration test for control connection refresh. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
SiyaoIsHiding
left a comment
There was a problem hiding this comment.
I think, the main reason why it broke in the past, is that this.host and this.connection are a nullable type, (they are supposed to be null in certain circumstainces like refreshing), but function calls like this._setHealthListeners(this.host, this.connection); are not treating them as nullable. Imagine if this is TypeScript, it would complain that _setHealthListeners expects Host, Connection as argument while this call passes Host?, Connection?.
So, aside from the _refreshInProgress flag to make sure only one refresh is happening at one time, I think we should still make sure we are using this.host and this.connection as nullable.
That means this._setHealthListeners(this.host, this.connection); should be
if (this.host && this.connection){
this._setHealthListeners(this.host, this.connection);
}
And other places that access this.host and this.connection should also has null guards, like this.
What do you think?
As far as I can tell, the nullability is meant to be a transient state and trying to impose these silent null checks may just cause more subtle bugs than it solves? I wouldn't be against explicit null checks in in an ideal world we may want to use an explicit state machine but that seems pretty out of scope for this PR |
|
Why
? |
Just to step back for a second we need to decide if it's even valid to get to the end of I don't think it should be, no? Meaning if it happens, we're just enabling the bug to propagate silently instead of catching it and yelling to the user that that's happened and needs to be patched (which it really should never happen with this refresh fix or there's a bigger issue) |
1ca12ed to
91a7b81
Compare
|
No it shouldn't reach the end of _refresh without |
|
This test failure is new and it concerns me. After some investigation, I think the problem is that a schema refresh triggered by an event can be permanently lost if the CC's connection is down at that moment and cannot establish new connection within 2 seconds. And this singleton refresh implementation makes it more vulnerable than before. What might have happened: In CI, a brief TCP hiccup (or Cassandra-side connection reset) closed the CC's socket during the 100ms debounce window after the ALTER KEYSPACE event:
In the past, when we accidentally allowed concurrent Fix Instead of allowing concurrent refresh, I think we should fix the problem of schema refresh errors being swallowed. For example, in // Instead of: toBackground(this.handleSchemaChange(event, false))
this.handleSchemaChange(event, false).catch(() => {
// CC will reconnect; re-queue this event once it does
this.once('newConnection', (err) => {
if (!err) toBackground(this.handleSchemaChange(event, false));
});
}); |
|
Bret points out that the Java driver's behavior is that every time a new control connection is established, it refreshes schema It makes sense to me that Node.js driver can align with Java driver's behavior. |
|
After some deliberation we think we can just call |
|
Confirming above is correct.
|
SiyaoIsHiding
left a comment
There was a problem hiding this comment.
This test still fails
1) ControlConnection
#init()
should subscribe to SCHEMA_CHANGE events and refresh keyspace information:
Error: Condition still false after 100 attempts: () => !cc.metadata.keyspaces['sample_change_1']
at whilstItem (test/test-helper.js:769:23)
at Timeout.next [as _onTimeout] (lib/utils.js:1042:5)
at listOnTimeout (node:internal/timers:605:17)
at process.processTimers (node:internal/timers:541:7)
I think it's possible that
- When the driver is selecting keyspace
sample_change_1info, that keyspace still existed in the DB, so DB sends the response of keyspace info, but it does not reach the driver yet. - The DROP statement executes.
- Driver
delete metadata.keyspaces['sample_change_1'] - The
sample_change_1keyspace info finally reaches the driver - Driver assigns
cc.metadata.keyspaces['sample_change_1']
I need to think about how to fix this..... But I think this path is possible to fail this test.
| // Don't attempt to reconnect when the ControlConnection is being shutdown | ||
| if (self._isShuttingDown) { | ||
| // Don't attempt to reconnect when the ControlConnection is being shutdown | ||
| this.log('info', 'The ControlConnection will not be refreshed as the Client is being shutdown'); |
There was a problem hiding this comment.
| this.log('info', 'The ControlConnection will not be refreshed as the Client is being shutdown'); | |
| self.log('info', 'The ControlConnection will not be refreshed as the Client is being shutdown'); |
| this.metadata.initialized = true; | ||
| } | ||
|
|
||
| this.metadata.initialized = true; |
There was a problem hiding this comment.
nit: this will have the side effect that if metadata sync is not enabled, this.metadata.initialized = true; will still run everytime it refreshes, not just once at initialization.
This PR superceeds Jane He's #430 with her blessing
After some investigation, we were unable to figure out the root cause behind the NPEs, with there being multiple potential avenues where the issue may have originated from, and so we decided to fix the issue at the lowest and simplest level we could–simply adding a stronger concurrency control to
_refreshdirectly via a_refreshInProgressflagI personally believe the issue stemmed from
_setHealthListenersbeing called multiple times on the same host/connection, causing the listeners to trigger refreshes multiple times for the same event, leading to the NPEs mentioned in the ticket.However the issue is quite hard to organically reproduce so the theory remains a theory.
Potential trace
which means that there's the potential of, sequentially: